music information retrieval
Twenty-Five Years of MIR Research: Achievements, Practices, Evaluations, and Future Challenges
Peeters, Geoffroy, Rafii, Zafar, Fuentes, Magdalena, Duan, Zhiyao, Benetos, Emmanouil, Nam, Juhan, Mitsufuji, Yuki
In this paper, we trace the evolution of Music Information Retrieval (MIR) over the past 25 years. While MIR gathers all kinds of research related to music informatics, a large part of it focuses on signal processing techniques for music data, fostering a close relationship with the IEEE Audio and Acoustic Signal Processing Technical Commitee. In this paper, we reflect the main research achievements of MIR along the three EDICS related to music analysis, processing and generation. We then review a set of successful practices that fuel the rapid development of MIR research. One practice is the annual research benchmark, the Music Information Retrieval Evaluation eXchange, where participants compete on a set of research tasks. Another practice is the pursuit of reproducible and open research. The active engagement with industry research and products is another key factor for achieving large societal impacts and motivating younger generations of students to join the field. Last but not the least, the commitment to diversity, equity and inclusion ensures MIR to be a vibrant and open community where various ideas, methodologies, and career pathways collide. We finish by providing future challenges MIR will have to face.
Pretrained Conformers for Audio Fingerprinting and Retrieval
Altwlkany, Kemal, Selmanovic, Elmedin, Delalic, Sead
Conformers have shown great results in speech processing due to their ability to capture both local and global interactions. In this work, we utilize a self-supervised contrastive learning framework to train conformer-based encoders that are capable of generating unique embeddings for small segments of audio, generalizing well to previously unseen data. We achieve state-of-the-art results for audio retrieval tasks while using only 3 seconds of audio to generate embeddings. Our models are almost completely immune to temporal misalignments and achieve state-of-the-art results in cases of other audio distortions such as noise, reverb or extreme temporal stretching. Code and models are made publicly available and the results are easy to reproduce as we train and test using popular and freely available datasets of different sizes.
MAIA: An Inpainting-Based Approach for Music Adversarial Attacks
Liu, Yuxuan, Zhang, Peihong, Sang, Rui, Li, Zhixin, Li, Shengchen
Music adversarial attacks have garnered significant interest in the field of Music Information Retrieval (MIR). In this paper, we present Music Adversarial Inpainting Attack (MAIA), a novel adversarial attack framework that supports both white-box and black-box attack scenarios. MAIA begins with an importance analysis to identify critical audio segments, which are then targeted for modification. Utilizing generative inpainting models, these segments are reconstructed with guidance from the output of the attacked model, ensuring subtle and effective adversarial perturbations. We evaluate MAIA on multiple MIR tasks, demonstrating high attack success rates in both white-box and black-box settings while maintaining minimal perceptual distortion. Additionally, subjective listening tests confirm the high audio fidelity of the adversarial samples. Our findings highlight vulnerabilities in current MIR systems and emphasize the need for more robust and secure models.
Layer-wise Investigation of Large-Scale Self-Supervised Music Representation Models
Zhou, Yizhi, Zhu, Haina, Chen, Hangting
Recently, pre-trained models for music information retrieval based on self-supervised learning (SSL) are becoming popular, showing success in various downstream tasks. However, there is limited research on the specific meanings of the encoded information and their applicability. Exploring these aspects can help us better understand their capabilities and limitations, leading to more effective use in downstream tasks. In this study, we analyze the advanced music representation model MusicFM and the newly emerged SSL model MuQ. We focus on three main aspects: (i) validating the advantages of SSL models across multiple downstream tasks, (ii) exploring the specialization of layer-wise information for different tasks, and (iii) comparing performance differences when selecting specific layers. Through this analysis, we reveal insights into the structure and potential applications of SSL models in music information retrieval.
Learning Music Audio Representations With Limited Data
Plachouras, Christos, Benetos, Emmanouil, Pauwels, Johan
--Large deep-learning models for music, including those focused on learning general-purpose music audio representations, are often assumed to require substantial training data to achieve high performance. If true, this would pose challenges in scenarios where audio data or annotations are scarce, such as for underrepresented music traditions, non-popular genres, and personalized music creation and listening. Understanding how these models behave in limited-data scenarios could be crucial for developing techniques to tackle them. In this work, we investigate the behavior of several music audio representation models under limited-data learning regimes. We consider music models with various architectures, training paradigms, and input durations, and train them on data collections ranging from 5 to 8,000 minutes long. We evaluate the learned representations on various music information retrieval tasks and analyze their robustness to noise. We show that, under certain conditions, representations from limited-data and even random models perform comparably to ones from large-dataset models, though handcrafted features outperform all learned representations in some tasks. Large models tackling audio-based Music Information Retrieval (MIR) tasks are commonly thought to require substantial amounts of training data [1], [2].
CrossMuSim: A Cross-Modal Framework for Music Similarity Retrieval with LLM-Powered Text Description Sourcing and Mining
Tsoi, Tristan, Deng, Jiajun, Ju, Yaolong, Weck, Benno, Kirchhoff, Holger, Lui, Simon
--Music similarity retrieval is fundamental for managing and exploring relevant content from large collections in streaming platforms. This paper presents a novel cross-modal contrastive learning framework that leverages the open-ended nature of text descriptions to guide music similarity modeling, addressing the limitations of traditional uni-modal approaches in capturing complex musical relationships. T o overcome the scarcity of high-quality text-music paired data, this paper introduces a dual-source data acquisition approach combining online scraping and LLM-based prompting, where carefully designed prompts leverage LLMs' comprehensive music knowledge to generate contextually rich descriptions. Extensive experiments demonstrate that the proposed framework achieves significant performance improvements over existing benchmarks through objective metrics, subjective evaluations, and real-world A/B testing on the Huawei Music streaming platform. Music similarity retrieval plays an important role in many music information retrieval (MIR) tasks, such as music recommendation [1], personalized playlist generation [2] and background music replacement in video editing [3], [4]. As digital music collections rapidly expand within streaming platforms, accurately identifying similarities between musical pieces has become critical for managing and exploring relevant content from such large collections efficiently.
HARP 2.0: Expanding Hosted, Asynchronous, Remote Processing for Deep Learning in the DAW
Benetatos, Christodoulos, Cwitkowitz, Frank, Pruyne, Nathan, Garcia, Hugo Flores, O'Reilly, Patrick, Duan, Zhiyao, Pardo, Bryan
HARP 2.0 brings deep learning models to digital audio workstation (DAW) software through hosted, asynchronous, remote processing, allowing users to route audio from a plug-in interface through any compatible Gradio endpoint to perform arbitrary transformations. HARP renders endpoint-defined controls and processed audio in-plugin, meaning users can explore a variety of cutting-edge deep learning models without ever leaving the DAW. In the 2.0 release we introduce support for MIDI-based models and audio/MIDI labeling models, provide a streamlined pyharp Python API for model developers, and implement numerous interface and stability improvements. Through this work, we hope to bridge the gap between model developers and creatives, improving access to deep learning models by seamlessly integrating them into DAW workflows.
Linguistic Analysis of Sinhala YouTube Comments on Sinhala Music Videos: A Dataset Study
De Mel, W. M. Yomal, de Silva, Nisansa
This research investigates the area of Music Information Retrieval (MIR) and Music Emotion Recognition (MER) in relation to Sinhala songs, an underexplored field in music studies. The purpose of this study is to analyze the behavior of Sinhala comments on YouTube Sinhala song videos using social media comments as primary data sources. These included comments from 27 YouTube videos containing 20 different Sinhala songs, which were carefully selected so that strict linguistic reliability would be maintained and relevancy ensured. This process led to a total of 93,116 comments being gathered upon which the dataset was refined further by advanced filtering methods and transliteration mechanisms resulting into 63,471 Sinhala comments. Additionally, 964 stop-words specific for the Sinhala language were algorithmically derived out of which 182 matched exactly with English stop-words from NLTK corpus once translated. Also, comparisons were made between general domain corpora in Sinhala against the YouTube Comment Corpus in Sinhala confirming latter as good representation of general domain. The meticulously curated data set as well as the derived stop-words form important resources for future research in the fields of MIR and MER, since they could be used and demonstrate that there are possibilities with computational techniques to solve complex musical experiences across varied cultural traditions
Hidden Echoes Survive Training in Audio To Audio Generative Instrument Models
Tralie, Christopher J., Amery, Matt, Douglas, Benjamin, Utz, Ian
In our work, though, we do not seek to influence As generative techniques pervade the audio domain, the behavior of the model so drastically, but rather to there has been increasing interest in tracing back through "tag" the data in such a way that the model reproduces these complicated models to understand how they draw the tag, similarly to how [10] watermark their training on their training data to synthesize new examples, both data for a diffusion image model. We are also inspired to ensure that they use properly licensed data and also to by the recent lawsuit by Getty Images against Stable elucidate their black box behavior. In this paper, we show Diffusion when it was discovered that the latter would that if imperceptible echoes are hidden in the training often reproduce the former's watermarks in its output data, a wide variety of audio to audio architectures (differentiable [29]. We would like to do something similar with audio, digital signal processing (DDSP), Realtime but to keep it imperceptible. Audio Variational autoEncoder (RAVE), and "Dance All of the above approaches use neural networks to Diffusion") will reproduce these echoes in their outputs.
CHORDONOMICON: A Dataset of 666,000 Songs and their Chord Progressions
Kantarelis, Spyridon, Thomas, Konstantinos, Lyberatos, Vassilis, Dervakos, Edmund, Stamou, Giorgos
Chord progressions encapsulate important information about music, pertaining to its structure and conveyed emotions. They serve as the backbone of musical composition, and in many cases, they are the sole information required for a musician to play along and follow the music. Despite their importance, chord progressions as a data domain remain underexplored. There is a lack of large-scale datasets suitable for deep learning applications, and limited research exploring chord progressions as an input modality. In this work, we present Chordonomicon, a dataset of over 666,000 songs and their chord progressions, annotated with structural parts, genre, and release date - created by scraping various sources of user-generated progressions and associated metadata. We demonstrate the practical utility of the Chordonomicon dataset for classification and generation tasks, and discuss its potential to provide valuable insights to the research community. Chord progressions are unique in their ability to be represented in multiple formats (e.g. text, graph) and the wealth of information chords convey in given contexts, such as their harmonic function . These characteristics make the Chordonomicon an ideal testbed for exploring advanced machine learning techniques, including transformers, graph machine learning, and hybrid systems that combine knowledge representation and machine learning.